<!DOCTYPE html>
<html lang="en">

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Projects to improve performance on IA-64</title>
<link rel="stylesheet" type="text/css" href="https://gcc.gnu.org/gcc.css" />
</head>

<body>
<h1>Projects to improve performance on IA-64</h1>
<!-- table of contents start -->
<h2 id="toc">Contents:</h2>

<ul>
<li><a href="#short_term_projects">Short-term projects</a></li>
<li><a href="#long_term_projects">Long-term and infrastructure projects</a></li>
<li><a href="#tool_projects"> Tools: performance tools, benchmarks, etc.</a></li>
</ul>
<!-- table of contents end -->

<p> This page lists projects that are expected to improve the performance
of the code that GCC generates for IA-64, more properly known as IPF
(Itanium Processor Family).
The lists originally came out of the GCC IA-64 Summit that was held
June 6, 2001, and many of the comments are from that summit.
Later updates are from discussions among people working in this area.
Additions and corrections are always welcome.</p>

<p>During the June 2001 summit, developers of proprietary IA-64 compilers
stressed that interactions between optimizations for IA-64 can be very
significant, more so than with other architectures.  People contributing
IA-64 improvements are highly encouraged to work closely with people
working on related improvements so that adverse interactions can be
detected early.</p>

<h2 id="short_term_projects">Short-term projects</h2>

<ul>
<li>Track memory origin to allow better alias analysis

<p>At the summit in June 2001, Ross Towle said that some optimizations
for IA-64 fall out nicely if data dependence information is as perfect
as it can be.  At that time GCC did not keep track of this information
well at all, and experienced GCC developers reported that
alias analysis in GCC during scheduling is extremely weak; it can even
lose track of which addresses are supposed to come from the stack frame.
It's weak in general and is even weaker in IA-64.  Alias analysis is a
general infrastructure problem; GCC has no knowledge of cross-block
scheduling.</p>

<p>Since then, Richard Kenner has checked in several patches to track
memory origins.  His changes link each MEM to the declaration it's from
so that alias analysis can know that two MEMs from different declarations
can't conflict.  This allows other things to be specified in a MEM, like
alignment.  He's also added functionality to prove that two MEMs cannot
conflict.</p>
</li>

<li>Make better use of alias information
<p>Now that better alias information is available, GCC should make use
of it.</p>

<p><em>What kinds of projects could now make use of Richard's memory origin
work?  Is the new information available during scheduling?  What other
optimizations could use it?</em></p>
</li>

<li>Data prefetch support

<p>At the GCC IA-64 Summit in June 2001, developers of other IA-64 compilers
said that optimizations involving compiler generated data prefetch are
important for IA-64 performance.</p>

<p>GCC 3.1 includes a prefetch RTL pattern that supports data prefetch on
a variety of GCC targets, a <code>__builtin_prefetch</code> function, and
the optimization <code>-fprefetch-loop-arrays</code>.  General information
about data prefetch and about data prefetch instructions supported by a
variety of GCC targets are described in the
<a href="prefetch.html">Data Prefetch Support</a> section of the Projects
list.</p>

<p>Janis Johnson is trying tweaks to the heuristics used for the
<code>-fprefetch-loop-arrays</code> optimization to try to get better
performance on IA-64.</p>
</li>

<li>Use existing dependence distance code
<p>There is dependence distance code already checked into the compiler that
no one uses.  That information could be hooked into the loop unroller
and the prefetcher.
For example, this is to check that references to two different
array elements within a loop iteration don't conflict.
See the code in <code>dependency.c</code> to see if
it uses the MEM tracking information and if the dependence distance code
itself is ever used in any loop optimization or could be used there.
This could also be
hooked up to the MEM info struct and used for iteration distance.
</p>
</li>

<li>Make better use of dependence information in scheduling
</li>

<li>Code locality; order functions based on profiling
<p>Code locality is even more important for this architecture than for
others where it shows a benefit.</p>

<p>There is an article by Carl Pettis and Bob Hansen about how to order
functions based on a call graph: "Profile guided code positioning",
http://acm.proxy.nova.edu/pubs/articles/proceedings/pldi/93542/p16-pettis/p16-pettis.pdf.</p>

<p>Steve Christiansen tried using gprof output to create a linker script
that orders functions based on run-time call graphs and call counts,
but couldn't show that it made a difference, based on SPEC CPU2000 results.</p>
</li>

<li>Code locality: exploit existing profile-directed block ordering
<p>Jan Hubicka, together with Richard Henderson and Andreas Jaeger,
made several changes to the profile-directed block ordering in GCC
for GCC 3.1.  This functionality
is available through <code>-fbranch-probabilities</code>
using data generated by first compiling with <code>-fprofile-arcs</code>.
This is described in
<a href="../news/profiledriven.html">Infrastructure for
Profile Driven Optimizations</a>, which also lists items for future work.</p>

<p>The following items came out of the June 2001 summit as issues to
investigate:</p>
<ul>
<li>GCC does not split a function into multiple regions, although that
has been mentioned as a possibility.</li>
<li>Profile information could be used to improve linearization of the code;
the CFG branch includes some support for trace scheduling.</li>
<li>Profile information could be used for if-conversion to decide which side
of the branch should be predicated.</li>
<li>Profile information is used for predication and delay slots.</li>
</ul>
</li>

<li>Code locality: static function ordering
<p>Look into SGI's tool CORD to determine whether its techniques can be
used with GCC.</p>
</li>

<li>Inlining: improve the heuristics used to guide inlining with -O3
<p>Some of this was done in the summer of 2001 and is in GCC 3.1.
There might be more work that could be done here.</p>
</li>

<li>Improve the machine model
<p>Validate that the machine model in GCC is accurate.  This would
be most useful when specific problems are noticed in generated code,
rather than making a full pass through it.</p>

<p>Look into incorporating information from Intel's KAPI library into
the machine model in GCC.</p>
</li>

<li>Improve GCC instruction bundling
<p>The machine model should guide instruction bundling, but currently it
is done using ad-hoc methods.</p>

<p>To evaluate instruction bundling, look at nop density.</p>
</li>

<li>Register allocator knowledge of hidden RSE costs
<p>The register allocator needs to know that there is some cost in
allocating additional stack registers because there's the danger of
hidden spilling in the Register Stack Engine (RSE) at the time of a
call.</p>
</li>

<li>Control speculation for loads
<p>This doesn't require recovery code and is quite simple,
with <code>chk.s</code>.</p>
</li>

<li>Straight-line post-increment
<p>Turning off the current support actually makes faster code for IA-64,
since it tends to create extra dependencies.  For it to be used effectively
post-increment could be generated after the second scheduling pass, with a
third pass then required.</p>

<p>Post-increment could be used when optimizing for size.</p>

<p>Exploit opportunities for non-loop induction variables.</p>
</li>

<li>Enable branch target alignment
<p>It's necessary to measure the trade-offs between alignment and code
size.</p>
</li>

<li>Align procedures
<p>This isn't turned on for IA-64; again, measure the trade-offs.</p>
</li>

<li>Tuning for Itanium 2
<p>Tuning for Itanium 2, controlled by <code>-mtune</code>, should be added.
</p>
</li>

<li>Double test conversion
<p>Jan Hubicka added support to the mainline (to become 3.2) to do
branch combining of chained branches having the same destination, with
hooks for target-specific tricks.  Such tricks are expected to be
worthwhile for IA-64; see the thread in the
<a href="https://gcc.gnu.org/ml/gcc-patches/2002-06/msg00026.html">
gcc-patches archives</a>.
</p>
</li>

</ul>

<h2 id="long_term_projects">Long-term and infrastructure projects</h2>

<ul>
<li>Region formation heuristics
<p>John Sias explains:
"Region formation is a way of coping with either limitations of the
machine or limitations of the compiler / compile time.  "Regions" are
control-flow-subgraphs, formed by various heuristics, usually to
perform transformations (i.e. hyperblock formation) or to do register
allocation or other work-intensive things.  For hyperblock formation,
for example, region formation heuristics are critical---selecting too
much unrelated code wastes resources; conversely, missing important
paths that interact well with each other defeats the purpose of the
transformation.  Large functions are sometimes broken heuristically
into regions for compilation, with the goal of reducing compile time."
</p>

<p>Richard Henderson says we could rip out the Haifa scheduler's CFG
detection, use regular data structures, and fix region detection.</p>
</li>

<li>Language-independent tree optimizations
<p>Now that the <a href="tree-ssa/">tree-ssa</a> branch has been merged
into mainline, we can perform cool optimizations that require more
information than is available in RTL.</p>
</li>

<li>Inlining: use profile information to guide inlining
<p>The infrastructure for this is not yet available.</p>
</li>

<li>High-level loop optimizations
<p>The <a href="tree-ssa/lno.html">lno-branch</a> can perform many
high-level loop optimisations.</p>
</li>

<li>Hyperblock scheduling
<p>This requires highly predicated code.</p>
</li>

<li>Predication
<p>There is little or no knowledge of predication outside of the if-cvt.c
file, so there are a number of optimization passes that are suboptimal
when predicated code is present.  None of the optimization passes up to
and including register allocation know how to handle predication from a
correctness standpoint.</p>

<ul>
<li>if-conversion</li>
<li>finding longer strings of logical</li>
<li>PQS (Predicate Query System)</li>
<li>disjoint predicates</li>
</ul>
<p>PQS is a database of known relationships between predicates.  It would
underlie predicate-aware dataflow, and therefore dependence drawing and
register allocation.</p>
</li>

<li>Data speculation
<p>Bernd Schmidt made an unsuccessful attempt to add data speculation.
Completing the patch won't be worthwhile until there is a sufficient
amount of ILP.</p>

<p>The IBM IA-64 compiler team saw code in important applications that
could have benefited from very local data speculation; see comments by
Jim McInnes in the minutes of the GCC IA-64 Summit.</p>
</li>

<li>Control speculation
<p>Control speculation is more important than data speculation.  It needs
cross-block scheduling, since the compiler doesn't see the opportunity
or need within a basic block.  Both require generating recovery code,
which introduces new instructions and new register definitions and uses.
It might be difficult to build in.</p>

<p>Some people at Red Hat tried unsuccessfully to tie control speculation
into the Haifa scheduler, but the effort showed that alias analysis in
GCC during scheduling is extremely weak.  One problem was that it
couldn't even tell which addresses are from the stack frame and so it
would speculate too much.  This project was tried quite quickly, though,
and with more time such a project might be successful.
Since then, Richard Kenner has added support for tracking memory origin,
so this might be more successful now.</p>

<p>Bernd Schmidt might have an unfinished patch that could be picked
up.</p>

<p>Stan Cox also had an unfinished control speculation patch.</p>
</li>

<li>Modulo scheduling
</li>

<li>Rotating registers
</li>

<li>Function splitting (moving function into two regions), for locality
<p>This is difficult if an exception is involved.</p>

<p>Dwarf2 is the only debugging format that can handle this.</p>
</li>

<li>Optimization of structures that are larger than a register
<p>The infrastructure doesn't currently handle this.  This is related to
memory optimizations.</p>
</li>

<li>Instruction prefetching
<p>Data prefetching is mentioned under short-term projects.
Instruction prefetching requires additional infrastructure.</p>
</li>

<li>Use of BBB template for multi-way branches (e.g. switches)
<p>It might be difficult to keep track of this in the machine-independent
part of GCC.</p>
</li>

<li>Cross-module optimizations
<p>Avoid reloads of GP when it is not necessary.  The compiler needs more
information than is currently available.</p>
</li>

<li>Register allocator handling GP as special
</li>

<li>C++ optimizations
<p>Jason Merrill invented cool stuff, e.g. thunks for multiple inheritance,
that hasn't been done yet.</p>

<p>It's possible to inline stubs.</p>
</li>

<li>"external" attribute or pragma
<p>This would be for information like DLL import/export; it is not
machine independent.</p>

<p>If GCC defined such an attribute, glibc would probably use it.</p>
</li>

</ul>

<h2 id="tool_projects">Tools: performance tools, benchmarks, etc.</h2>

<ul>
<li>Analyze benchmark results to identify important optimizations
<p>One of the projects identified at the GCC IA-64 Summit is measuring the
performance of GCC on IPF, comparing it to other IPF compilers, and
identifying the reasons for performance differences.  This would enable
the limited developer resources to be spent on those improvements that
are most likely to affect the performance of the applications that are
identified as being important.</p>

<p>This project can be broken up into a number of tasks that can be
performed by separate teams to best utilize the experience and strengths
of each team.</p>

<ul>
<li>Determine which benchmarks most accurately reflect the performance of
Linux system software and the enterprise applications that are most
likely to be used on IPF platforms (or software of interest to you).</li>
<li>Run the selected benchmarks on other architectures with GCC.</li>
<li>Run the selected benchmarks on Itanium with GCC and other IPF
compilers (Intel, HP, SGI).</li>
<li>Determine which significant sections of benchmark code show the worst
relative performance of GCC on Itanium.</li>
<li>Analyze the assembly code generated by GCC and by the proprietary
IPF compilers to determine, where possible, which optimizations would
most improve the performance of GCC code.</li>
<li>Pass on information to GCC developers about the relative value of
various short-term and long-term optimizations.</li>
</ul>
</li>

<li>Benchmark specific optimizations
<p>Run benchmarks with GCC for IPF with a variety of options for specific
optimizations to determine which ones should be included with gcc -O2.</p>
</li>

<li>Profile the Linux kernel
<p>Profile the kernel and look for hot spots where better code generation
or optimization would make a significant difference.</p>

<p>Gary Hade at IBM has been collecting profile and coverage data for
2.4.18 IA-64 Linux kernels built with prerelease versions of GCC 3.1.
Profile data collection utilizes the SGI-donated Kernprof facility.
Coverage data collection utilizes the IBM-donated GCOV kernel facility.
The data is being generated under various system loads including
parallel Linux kernel builds, AIM Suite VII Benchmark, and the
SPEC SDET Benchmark.  Much of the data has already been collected but
it still needs to be analyzed.</p>

<p>Information and code for the data collection facilities and workloads
mentioned above is available from these sources:</p>
<ul>
<li>Kernprof: http://oss.sgi.com/projects/kernprof/
</li>

<li>SPEC SDET (proprietary benchmark):
    <a href="http://www.spec.org/osg/sdm91/#sdet">
    description at specbench web site</a>;
    <a href="http://www.citi.umich.edu/projects/linux-scalability/reports/spec.html">
    description at citi web site</a>
</li>
</ul>
</li>

<li>Dispersal analysis
<p>Steve Christiansen wrote a dispersal analysis tool that uses objdump
disassembly output.  It uses McKinley rules and cannot be distributed
outside of IBM.</p>

<p>Gary Hade added Itanium 1 support to Steve's dispersal analysis
code and integrated the code into GNU Binutils source so it can be invoked
and controlled from objdump using IA-64 specific disassembler options.
The results of the optional dispersal analysis are added to the
disassembly output.  Gary submitted a patch supporting Itanium 1 to the
Binutils project via the bug-binutils mailing list on June 6, 2002.  An
update adding Itanium 2 support can be provided after Intel makes the
McKinley information public.</p>
</li>

<li>Statistics gathering tool
</li>

<li>PMU-based performance monitor
</li>

<li>Small test cases and sample codes for examining generated code
<p>Developers of proprietary IPF compilers who have identified key code
fragments from real applications where IPF optimizations make a big
difference could share these with GCC developers.</p>
</li>

<li>Compiler instrumentation that would cause an application to dump
performance counter information
</li>

<li>Fix profiling tools so they work with threads
<p>This would allow a tool that uses profiling output to order functions to
be used with a wider variety of applications.</p>
</li>

</ul>

</body>
</html>
